What makes a movie become a successful blockbuster? Why are there certain movies immediately being in the public's eyes? How is a successul movie supposed to be? Those questions have not only been the common interests of the movie enthusiasts but also the interesting research area for data scientists/analysts in recent years. Considering the widely-used movie grading systems, IMDB seems to be the most prevelant one among critics as well as public viewers. In this report, the IDBMD dataset from the Kaggle community is employed to provide more profound insights into different aspects of the successful movies. Particularly, using Python and relevant libraries/packages for data analytics and visualisation, this report will try to find answers for following questions:
To address such questions above, the report will be constructed as follows 2. Data Pre-processing, 3. Directors/Actors vs. Gross Earnings, 4. Movies Comparisons, 5. Distribution of the Gross Earnings, 6. Genres Analysis, 7. IMDB scores vs. Other Variables & 8. Conclusion
First of all, relevant libraries/packages are imported so that the data analytics process could be carried out in the following steps. NB: the package missingno is to visualise the the missing values in the dataset. For more information, please follow the link.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
The dataset movie_metadata.csv is imported as the pandas dataframe df. Regarding the data's variables and their desciptions, in short, there are 28 variables together with 5043 observations. Accordingly, the detailed summary of the dataframe df: the variables' descriptions, number of the non-null values and datatype for each of the variable are listed below.
df = pd.read_csv('movie_metadata.csv')
# dimension of the origninal dataframe
df.shape
| No | Variable | Description |
|---|---|---|
| 1 | movie_title | Title of the Movie |
| 2 | duration | Duration in minutes |
| 3 | director_name | Name of the Director of the Movie |
| 4 | director_facebook_likes | Number of likes of the Director on his Facebook Page |
| 5 | actor_1_name | Primary actor starring in the movie |
| 6 | actor_1_facebook_likes | Number of likes of the Actor_1 on his/her Facebook Page |
| 7 | actor_2_name | Other actor starring in the movie |
| 8 | actor_2_facebook_likes | Number of likes of the Actor_2 on his/her Facebook Page |
| 9 | actor_3_name | Other actor starring in the movie |
| 10 | actor_3_facebook_likes | Number of likes of the Actor_3 on his/her Facebook Page |
| 11 | num_user_for_reviews | Number of users who gave a review |
| 12 | num_critic_for_reviews | Number of critical reviews on imdb |
| 13 | num_voted_users | Number of people who voted for the movie |
| 14 | cast_total_facebook_likes | Total number of facebook likes of the entire cast of the movie |
| 15 | movie_facebook_likes | Number of Facebook likes in the movie page |
| 16 | plot_keywords | Keywords describing the movie plot |
| 17 | facenumber_in_poster | Number of the actor who featured in the movie poster |
| 18 | color | Film colorization. ‘Black and White’ or ‘Color’ |
| 19 | genres | Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’ |
| 20 | title_year | The year in which the movie is released (1916:2016) |
| 21 | language | English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc |
| 22 | country | Country where the movie is produced |
| 23 | content_rating | Content rating of the movie |
| 24 | aspect_ratio | Aspect ratio the movie was made in |
| 25 | movie_imdb_link | IMDB link of the movie |
| 26 | gross | Gross earnings of the movie in Dollars |
| 27 | budget | Budget of the movie in Dollars |
| 28 | imdb_score | IMDB Score of the movie on IMDB |
# detailed info of the dataframe
df.info()
For the purpose of the analysis, non-relevant columns, i.e. movie_imdb_link & plot_keywords, are dropped.
df = df.drop(['movie_imdb_link', 'plot_keywords'], axis=1);
# dimension of the dataframe after dropping the unwanted columns
df.shape
From the dataframe df, there are 45 duplicated observations. Therefore, dropping those observations is conducted to avoid errors in the following calculations and aggregations.
# counting the number of duplicated rows
df.duplicated().value_counts()
# dropping the duplicates
df.drop_duplicates(inplace=True)
# dimension of the dataframe after dropping the duplicates
df.shape
The missing values are counted and visualised below.
Observations having missing values are then dropped.
# number of missing values in each column
df.isnull().sum()
# the distribution of the missing values in the dataframe df
msno.matrix(df);
# barplot of the missing values in the dataframe df
msno.bar(df);
# dropping the missing values
df.dropna(inplace=True)
# dimension of the dataframe after dropping the missing values
df.shape
All values in the movie_title have characters '\xa0' at the end. Thus, removing all \xa0 characters in the colum movie_title is needed
df['movie_title'] = df['movie_title'].str.replace('\xa0','')
From the dataframe df, there are 1669 unique names of the directors. In order to calculate sum of the gross earnings for indivudual directors, the function groupby is used to sum up the gross earnings for each director. After that, the top 30 most successful directors are visualised. Besides, the distribution of the gross earnings are then investigated to provide a bigger picture of how the discrepancy regarding the 'success' of individual directors is shaped in the movie industry.
# number of unique directors
df['director_name'].nunique()
# creating a pandas series storing sum of the gross earnings for each director in the descending order
director = (df.groupby('director_name').sum()['gross']/1e9).sort_values(ascending = False)
# visualising the top 30 most successful directors
plot1 = director[:30].plot.bar(rot=90, title='Top 30 most successful directors with respect to gross earnings of the movies', figsize = (16, 5))
plot1.set_xlabel("Director's Name")
plot1.set_ylabel("Gross earnings of the movies (Billion Dollars)");
The Lorenz curve is a graphical representation of the distribution of income or of wealth. Besides that, closely associated with the Lorez Curve, the Gini coefficient is the ratio of the area between the line of perfect equality and the observed Lorenz curve to the area between the line of perfect equality and the line of perfect inequality. The higher the coefficient, the more unequal the distribution is. To be more specific, the range of the Gini coefficient is from 0 (perfect equality) to 1 (perfect inequality). Those two concepts are used to visualise and quantify the gross eanrings distribution among directors.
DIR = np.array(director)
DIR_lorenz = DIR.cumsum() / DIR.sum()
DIR_lorenz = np.insert(DIR_lorenz, 0, 0)
# establishing function calculating Gini coefficient
def gini(arr):
## first sort
sorted_arr = arr.copy()
sorted_arr.sort()
n = arr.size
coef_ = 2. / n
const_ = (n + 1.) / n
weighted_sum = sum([(i+1)*yi for i, yi in enumerate(sorted_arr)])
return coef_*weighted_sum/(sorted_arr.sum()) - const_
# calculating the gini coefficient of the gross earnings distribution of the directors
gini(DIR)
# setting the size of the plot
plt.figure(figsize=(8,8))
# plotting the gini curve
plt.scatter(DIR_lorenz,np.arange(DIR_lorenz.size)/(DIR_lorenz.size-1))
# setting the limits of x-axis and y-axis
plt.xlim([0,1])
plt.ylim([0,1])
# plotting the diagonal line of equality
plt.plot([0,1], [0,1], color = 'k', linestyle='dashed')
# adding annotations
plt.xlabel('Cumulative population of directors from poorest to richest (Percentage)')
plt.ylabel('Cumulative portion of gross earnings (Percentage)')
plt.title('LOREZ CURVE - DIRECTOR')
plt.legend(['Line of perfect equality',
'Actual gross distribution'],
loc="upper left",
prop={"size":13})
# plotting the plot
plt.show();
Conclusion: the graphical representation together with the gini coefficient of 0.746 for the gross earnings distribution among directors indicate that there is a fairly huge gap between the top successful directors with the least successful ones. In another word, the minority of the most successful directors has accumlated a large proportion of wealth (gross earnings). Meanwhile, the majority of the least successful directors have their wealth which is obviously outweighed by the minority mentioned earlier. It might reflect one reality that the movie industry could somehow be harsh and higly selective for directors when it comes to success.
From the dataframe df, there are three columns actor_1_name, actor_2_name & actor_3_name. In order to calculate sum of the gross earnings of the movies for indivudual actors, there is a need to investigate any repetitions among those three columns for any observations (movies). Checking any repetitions of the actors' names between columns actor_1_name, actor_2_name & actor_3_name:
(df['actor_1_name'] == df['actor_2_name']).value_counts()
(df['actor_1_name'] == df['actor_3_name']).value_counts()
(df['actor_2_name'] == df['actor_3_name']).value_counts()
Conclusion: There is no repetitions of the actors' names among the columns actor_1_name, actor_2_name & actor_3_name for each of the film record. In another word, for one particular movie, an actor's name could only be either in actor_1_name, actor_2_name or actor_3_name. Thus, considering an actor's success with respect to the gross earning of movies, there is a need to find sum of the gross earnings in all of his/her roles, i.e. actor_1_name, actor_2_name & actor_3_name, in all movies that he/she participated.
# unique actors' names for the column actor_1_name
actor1 = df['actor_1_name'].unique()
# unique actors' names for the column actor_2_name
actor2 = df['actor_2_name'].unique()
# unique actors' names for the column actor_3_name
actor3 = df['actor_3_name'].unique()
# all unique actors' names
unique_actors = np.unique(np.concatenate((actor1, actor2, actor3)));
# sorting A-Z all unique actors' names
unique_actors_sorted = np.sort(unique_actors)
# number of the unique names of the actors
len(unique_actors_sorted)
# total gross earnings of the movies associated with 'actor_1_name'
actor1_gross = df.groupby('actor_1_name').sum()['gross']/1e9
# tota2 gross earnings of the movies associated with 'actor_2_name'
actor2_gross = df.groupby('actor_2_name').sum()['gross']/1e9
# tota3 gross earnings of the movies associated with 'actor_3_name'
actor3_gross = df.groupby('actor_3_name').sum()['gross']/1e9
# calculate total gross earnings for each of the actor in all of
# his/her positions, i.e. 'actor_1_name', 'actor_2_name' & 'actor_3 name'
temp = dict()
for i in unique_actors_sorted:
if i in actor1_gross and i in actor2_gross and i in actor3_gross:
temp[i] = actor1_gross[i] + actor2_gross[i] + actor3_gross[i]
elif i in actor1_gross and i in actor2_gross:
temp[i] = actor1_gross[i] + actor2_gross[i]
elif i in actor1_gross and i in actor3_gross:
temp[i] = actor1_gross[i] + actor3_gross[i]
elif i in actor2_gross and i in actor3_gross:
temp[i] = actor2_gross[i] + actor3_gross[i]
elif i in actor1_gross:
temp[i] = actor1_gross[i]
elif i in actor2_gross:
temp[i] = actor2_gross[i]
else:
temp[i] = actor3_gross[i]
# creating a pandas series storing sum of the gross earnings for each actor in the descending order
actor = pd.Series(temp).sort_values(ascending = False)
plot2 = actor[:30].plot.bar(rot=90, title='Most successful actors with respect to gross earnings of the movies', figsize = (16, 5))
plot2.set_xlabel("Actor's Name")
plot2.set_ylabel("Gross earnings of the movies (Billion Dollars)");
ACT = np.array(actor)
ACT_lorenz = ACT.cumsum() / ACT.sum()
ACT_lorenz = np.insert(ACT_lorenz, 0, 0)
# calculating the gini coefficient for gross earnings distribution among actors
gini(ACT)
# setting the size of the plot
plt.figure(figsize=(8,8))
# plotting the gini curve
plt.scatter(ACT_lorenz,np.arange(ACT_lorenz.size)/(ACT_lorenz.size-1))
# plotting the diagonal line of equality
plt.plot([0,1], [0,1], color = 'k', linestyle='dashed')
# setting the limits of x-axis and y-axis
plt.xlim([0,1])
plt.ylim([0,1])
# adding annotations
plt.xlabel('Cumulative population of actors from poorest to richest (Percentage)')
plt.ylabel('Cumulative portion of gross earnings (Percentage)')
plt.title('LOREZ CURVE - ACTOR')
plt.legend(['Line of perfect equality',
'Actual gross distribution'],
loc="upper left",
prop={"size":13})
# plotting the plot
plt.show();
Conclusion: the graphical representation together with the gini coefficient of 0.734 for the gross earnings distribution among actors indicate that there is a relatively huge gap between the top successful actors with the least successful ones. In another word, the minority of the most successful actors has accumlated a large proportion of wealth (gross earnings). Meanwhile, the majority of the least successful actors have their wealth which is unquestionably incomparable with the minority mentioned earlier. Similarly to the gross distribution among directors, actors are likely to encounter tough competitions among their peers, and having fairly small chances to be widely known in the movie industry.
Firstly, an overview of imdb_score, gross and movie_facebook_likes is given to provide deeper understandings about how those variables' distributions are determined so that a comparison of any two given movies will be drawn more clearly in the bigger pictures regarding those three variables' distribution.
df[['imdb_score', 'gross', 'movie_facebook_likes']].describe()
bin_size = 35
df['imdb_score'].hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['imdb_score'].median(), color = 'r', linestyle='dashed', linewidth=2)
plt.legend(['Median Score = ' + str(df['imdb_score'].median()),
'Distribution of IMDB scores'],
loc="upper left",
prop={"size":13})
plt.title('IMDB SCORE DISTRIBUTION (Histogram)')
plt.xlabel('IBMD score')
plt.ylabel('Count')
plt.show();
bin_size = 50
(df['gross']/1e6).hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['gross'].median()/1e6, color = 'r', linestyle='dashed', linewidth=2)
plt.legend(['Median Gross = ' + str(df['gross'].median()),
'Distribution of Gross Earnings'],
loc="upper right",
prop={"size":13})
plt.title('GROSS EARNING DISTRIBUTION (Histogram)')
plt.xlabel('Gross Earnings (Million US Dollars)')
plt.ylabel('Count')
plt.xlim((0,800))
plt.show();
bin_size = 30
df['movie_facebook_likes'].hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['movie_facebook_likes'].median(), color = 'r', linestyle='dashed', linewidth=2)
plt.legend(['Median Facebook Likes = ' + str(df['movie_facebook_likes'].median()),
'Distribution of Movie Facebook Likes'],
loc="upper right", prop={"size":13})
plt.title('MOVIE FACEBOOK LIKES DISTRIBUTION (Histogram)')
plt.xlabel('No. of Movie Facebook Likes')
plt.ylabel('Count')
plt.xlim((0,350000))
plt.show();
Conclusion:
In this case, Avatar and Transcendence are chosen to be compared with each other and to plot themselves on the distributions of imdb_score, gross and movie_facebook_likes. A short conclusion based upon graphical representations and quantified figures will be drawn
movie1 = 'Avatar'
movie2 = 'Transcendence'
movie1_color = 'orange'
movie2_color = 'deepskyblue'
comp_IMDB = df[['movie_title', 'imdb_score']][(df['movie_title'] == movie1) | (df['movie_title'] == movie2)]
comp_IMDB
ax = sns.barplot(x = 'movie_title', y = 'imdb_score', data = comp_IMDB, palette = (movie1_color, movie2_color))
ax.figure.set_size_inches(8, 6)
plt.title('IMDB score comparison between ' + movie1 + ' and ' + movie2)
plt.show();
bin_size = 35
df['imdb_score'].hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['imdb_score'].median(), color = 'r', linestyle='dashed', linewidth=2)
plt.axvline(x = comp_IMDB.iloc[0,1], color = movie1_color, linestyle='dashed', linewidth=2)
plt.axvline(x = comp_IMDB.iloc[1,1], color = movie2_color, linestyle='dashed', linewidth=2)
plt.legend(['Median Score = ' + str(df['imdb_score'].median()),
movie1 + ' = ' + str(comp_IMDB.iloc[0,1]),
movie2 + ' = ' + str(comp_IMDB.iloc[1,1]),
'Distribution of IMDB scores'],
loc="upper left",
prop={"size":13})
plt.title('IMDB SCORE DISTRIBUTION (Histogram)')
plt.xlabel('IBMD score')
plt.ylabel('Count')
plt.show();
Conclusion: With regards to IMDB scores, Avatar seems to be a better oppenent than the Transcendence with the scores of 7.9 and 6.3 respectively. Besides, Avatar could be categorised a "good" movie based on comparing with the median value of 6.6. Meanwhile, Transcendence might have more critics with the score being lower than the median "threshold".
# NB. gross in million us dollars
comp_gross = df[['movie_title', 'gross']][(df['movie_title'] == movie1) | (df['movie_title'] == movie2)]
comp_gross['gross'] = comp_gross['gross']/1e6
comp_gross
ax = sns.barplot(x = 'movie_title', y = 'gross', data = comp_gross, palette = (movie1_color, movie2_color))
ax.figure.set_size_inches(8, 6)
plt.title('IMDB score comparison between ' + movie1 + ' and ' + movie2)
plt.ylabel('gross earnings (million US dollars)')
plt.show();
bin_size = 50
(df['gross']/1e6).hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['gross'].median()/1e6, color = 'r', linestyle='dashed', linewidth=2)
plt.axvline(x = comp_gross.iloc[0,1], color = movie1_color, linestyle='dashed', linewidth=2)
plt.axvline(x = comp_gross.iloc[1,1], color = movie2_color, linestyle='dashed', linewidth=2)
plt.legend(['Median Gross = ' + str(df['gross'].median()/1e6),
movie1 + ' = ' + str(comp_gross.iloc[0,1]),
movie2 + ' = ' + str(comp_gross.iloc[1,1]),
'Distribution of Gross Earnings'],
loc="upper right",
prop={"size":13})
plt.title('GROSS EARNING DISTRIBUTION (Histogram)')
plt.xlabel('Gross Earnings (Million US Dollars)')
plt.ylabel('Count')
plt.xlim((0,800))
plt.show();
Conclusion: talking about the gross earnings, Avatar is the obvious winner to the Transcendence. It reached closely to the record of the movie with the highest gross earnings. Meanwhile, Transcendence's gross is even below the median gross. This huge gap might indicate how the movie industry has been polarised in terms of the gross earnings to a certain extent.
comp_movie_FB_likes = df[['movie_title', 'movie_facebook_likes']][(df['movie_title'] == movie1) | (df['movie_title'] == movie2)]
comp_movie_FB_likes
ax = sns.barplot(x = 'movie_title', y = 'movie_facebook_likes', data = comp_movie_FB_likes, palette = (movie1_color, movie2_color))
ax.figure.set_size_inches(8, 6)
plt.title('Movie Facebook Likes comparison between ' + movie1 + ' and ' + movie2)
plt.ylabel('No. of Movie Facebook Likes')
plt.show();
bin_size = 30
df['movie_facebook_likes'].hist(bins = bin_size, figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = df['movie_facebook_likes'].median(), color = 'r', linestyle='dashed', linewidth=2)
plt.axvline(x = comp_movie_FB_likes.iloc[0,1], color = movie1_color, linestyle='dashed', linewidth=2)
plt.axvline(x = comp_movie_FB_likes.iloc[1,1], color = movie2_color, linestyle='dashed', linewidth=2)
plt.legend(['Median Facebook Likes = ' + str(df['movie_facebook_likes'].median()),
movie1 + ' = ' + str(comp_movie_FB_likes.iloc[0,1]),
movie2 + ' = ' + str(comp_movie_FB_likes.iloc[1,1]),
'Distribution of Movie Facebook Likes'],
loc="upper right", prop={"size":13})
plt.title('MOVIE FACEBOOK LIKES DISTRIBUTION (Histogram)')
plt.xlabel('No. of Movie Facebook Likes')
plt.ylabel('Count')
plt.xlim((0,350000))
plt.show();
Conclusion: Taking about the polularity on the social platform, those two movies might be closely competed with each other with their achievements in the number of facebook likes for their movies. Their figures are both suprisingly good in comparison with the median value.
year_start = 1927
year_end = 2016
selected_gross_by_year = df[['title_year','gross']][(df['title_year'] >= year_start) & (df['title_year'] <= year_end)]
selected_gross_by_year['gross'] = selected_gross_by_year['gross']/1e6
# change 'title_year' to integer value
selected_gross_by_year['title_year'] = selected_gross_by_year['title_year'].astype('int64')
# plotting
ax = sns.countplot(data=selected_gross_by_year, x = 'title_year',color = 'slategrey')
ax.figure.set_size_inches(16, 6)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_axisbelow(True)
plt.title('Number of movies in individual years')
plt.xlabel('Year')
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.show();
Conclusion: the dataset offers an imbalanced number of movies for each year as clearly shown by the graph above. This pattern has a very strong effect on how the distributions of the gross earnings are shaped as years passing. The "old time" movies might not be representative enough when their distribution will be highly biased by small number of movies in the old time. However, it could somehow be a decent reference to observe how the distribution had been shaped during a chosen period of time.
max_gross_by_year = selected_gross_by_year.groupby('title_year').max()['gross']
min_gross_by_year = selected_gross_by_year.groupby('title_year').min()['gross']
mean_gross_by_year = selected_gross_by_year.groupby('title_year').mean()['gross']
median_gross_by_year = selected_gross_by_year.groupby('title_year').median()['gross']
max_gross_by_year.plot(figsize = (16,6))
min_gross_by_year.plot()
median_gross_by_year.plot()
mean_gross_by_year.plot()
plt.xticks(np.arange(year_start, year_end + 1, 1), rotation = 90)
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.ylim((selected_gross_by_year['gross'].min() - 20, selected_gross_by_year['gross'].max() + 20))
plt.title('THE DISTRIBUTION OF GROSS EARNINGS DURING THE CHOSEN RANGE OF YEARS (' + str(year_start) +' - ' + str(year_end) +')')
plt.xlabel('Years')
plt.ylabel('Gross Earnings (Million US Dollars)')
plt.legend(['Max Gross Earning',
'Min Gross Earning',
'Median Gross Earning',
'Mean Gross Earning'],
loc="upper left",
prop={"size":13});
ax = sns.boxplot(x='title_year', y='gross', data=selected_gross_by_year)
ax.figure.set_size_inches(16,6)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_axisbelow(True)
plt.title('THE DISTRIBUTION OF GROSS EARNINGS DURING THE CHOSEN RANGE OF YEARS (' + str(year_start) +' - ' + str(year_end) +')')
plt.xlabel('Year')
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.ylabel('Gross Earnings (Million US Dollars)');
Conclusion: Considering the whole period from 1927 to 2016, there is no clear tendency of the gross earnings' median and mean. This could be objectively explained by the imbalanced number of movies in each recorded year as mentioned clearly above. However, the most distinct feature might come from the each recored year's outliers or maximum values from 1975 to 2016 in both the line graph and boxplot as they clearly show an increasing trend.
# creating a set of genres, i.e. 'genre_set', in the alphabetical order
test = df['genres'].str.split('|')
genre_set = set()
for i in test:
for j in i:
genre_set.add(j)
genre_set = sorted(genre_set)
for i in genre_set:
print(i)
temp = dict()
for i in genre_set:
r = df['genres'].str.contains(i)
temp[i] = df[r].count()[0]
k = pd.Series(temp)
plot3 = k.plot.bar(rot=90, title= 'BARPLOT OF NUMBER OF MOVIES FOR EACH MOVIE GENRE',
figsize = (16, 5))
plot3.set_xlabel("Movie Genre")
plot3.set_ylabel("Number of Movies")
plot3.grid(linestyle='dashed', linewidth='0.5', color='grey')
plot3.set_axisbelow(True);
# creating a list of imdb scores for each of movie genre, i.e. 'imdb_of_all_genres'
imdb_of_all_genres = []
for i in genre_set:
r = df['genres'].str.contains(i)
imdb_for_single_genre = df[r]['imdb_score']
imdb_of_all_genres.append(imdb_for_single_genre)
plt.figure(figsize=(16,6))
box = plt.boxplot(imdb_of_all_genres, patch_artist = True)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
plt.yticks(np.arange(1,10.5, 0.5))
plt.xlabel('Movie Gerne')
plt.ylabel('IDBD score')
plt.title('BOXPLOTS OF IMDB SCORES GROUPED BY DIFFERENT MOVIE GENRES')
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.show();
# creating a list of gross earnings for each of movie genre, i.e. 'gross_of_all_genres'
gross_of_all_genres = []
for i in genre_set:
r = df['genres'].str.contains(i)
gross_for_single_genre = df[r]['gross']/1e8
gross_of_all_genres.append(gross_for_single_genre)
plt.figure(figsize=(16,6))
box = plt.boxplot(gross_of_all_genres, patch_artist = True, showfliers = False)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
plt.xlabel('Movie Gerne')
plt.ylabel('Gross Earnings (Million US Dollars)')
plt.title('BOXPLOTS OF GROSS EARNINGS GROUPED BY DIFFERENT MOVIE GENRES (WITHOUT OUTLIERS)')
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.show();
# creating a list of movie facebook likes for each of movie genre, i.e. 'movie_FB_likes_of_all_genres'
movie_FB_likes_of_all_genres = []
for i in genre_set:
r = df['genres'].str.contains(i)
movie_FB_likes_for_single_genre = df[r]['movie_facebook_likes']
movie_FB_likes_of_all_genres.append(movie_FB_likes_for_single_genre)
plt.figure(figsize=(16,6))
box = plt.boxplot(movie_FB_likes_of_all_genres, patch_artist = True, showfliers = False)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
plt.xlabel('Movie Gerne')
plt.ylabel('Number of Movie Facebook Likes')
plt.title('BOXPLOTS OF MOVIE FACEBOOK LIKES GROUPED BY DIFFERENT MOVIE GENRES (WITHOUT OUTLIERS)')
plt.grid(color = 'gainsboro', linestyle='dashed')
plt.show();
# selecting a genre for analysis
genre = 'Romance'
# creating a pandas Series containing all IMDB scores belonging to the selected genre, i.e. 'genre'
result = df['genres'].str.contains(genre)
imdb_of_selected_genre = df[result]['imdb_score']
imdb_of_selected_genre.hist(figsize = (16,6), color = 'slategrey').grid(False)
plt.axvline(x = imdb_of_selected_genre.mean(), color = 'red', linestyle='dashed', linewidth=2)
plt.axvline(x = imdb_of_selected_genre.median(), color = 'limegreen', linestyle='dashed', linewidth=2)
plt.legend(['Mean Score = ' + str(round(imdb_of_selected_genre.mean(), 2)),
'Median Score = ' + str(round(imdb_of_selected_genre.median(), 2))],
loc="upper left",
prop={"size":13})
plt.title('IMDB SCORE DISTRIBUTION OF ' + genre.upper() + ' MOVIES')
plt.xlabel('IBMD score')
plt.ylabel('Count')
plt.xticks(np.arange(0,10.5, 0.5), rotation = 0)
plt.show();
imdb_of_selected_genre.describe()
plt.figure(figsize=(16,6))
box = plt.boxplot(imdb_of_all_genres, patch_artist = True)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
plt.yticks(np.arange(1,10.5, 0.5))
plt.xlabel('Movie Gerne')
plt.ylabel('IDBD score')
plt.title('BOXPLOTS OF IMDB SCORES WITH ' + genre.upper()+ ' MOVIES (RED) AND OTHER MOVIE GENRES')
plt.grid(color = 'gainsboro',linestyle='dashed')
colors = ['silver'] * len(genre_set)
colors[genre_set.index(genre)] = 'red'
for patch, color in zip(box['boxes'], colors):
patch.set_facecolor(color)
plt.show();
temp = dict()
for i in genre_set:
r = df['genres'].str.contains(i)
temp[i] = df[r].count()[0]
k = pd.Series(temp)
plot4 = k.plot.bar(rot=90, title='BARPLOT OF NUMBER OF MOVIES FOR EACH MOVIE GENRE - ' + genre.upper() + ' (RED)',
figsize = (16, 6), color = colors)
plot4.set_xlabel("Movie Genre")
plot4.set_ylabel("Number of Movies")
plot4.grid(linestyle='dashed', linewidth='0.5', color='grey')
plot4.set_axisbelow(True);
plt.figure(figsize=(16,6))
box = plt.boxplot(gross_of_all_genres, patch_artist = True, showfliers = False)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
#plt.yticks(np.arange(1,10.5, 0.5))
plt.xlabel('Movie Gerne')
plt.ylabel('Gross Earnings (Million US Dollars)')
plt.title('BOXPLOTS OF GROSS EARNINGS WITH DIFFERENT MOVIE GENRES (WITHOUT OUTLIERS) - ' + genre.upper() + ' (RED)')
plt.grid(color = 'gainsboro', linestyle='dashed')
colors = ['silver'] * len(genre_set)
colors[genre_set.index(genre)] = 'red'
for patch, color in zip(box['boxes'], colors):
patch.set_facecolor(color)
plt.show();
plt.figure(figsize=(16,6))
box = plt.boxplot(movie_FB_likes_of_all_genres, patch_artist = True, showfliers = False)
plt.xticks(np.arange(1,len(genre_set)+1, 1), genre_set, rotation = 90)
plt.xlabel('Movie Gerne')
plt.ylabel('Number of Movie Facebook Likes')
plt.title('BOXPLOTS OF MOVIE FACEBOOK LIKES WITH DIFFERENT MOVIE GENRES (WITHOUT OUTLIERS) - ' + genre.upper() + ' (RED)')
plt.grid(color = 'gainsboro', linestyle='dashed')
colors = ['silver'] * len(genre_set)
colors[genre_set.index(genre)] = 'red'
for patch, color in zip(box['boxes'], colors):
patch.set_facecolor(color)
plt.show();
temp = df[['duration',
'director_facebook_likes',
'num_user_for_reviews',
'num_critic_for_reviews',
'num_voted_users',
'cast_total_facebook_likes',
'movie_facebook_likes',
'facenumber_in_poster',
'title_year',
'aspect_ratio',
'gross',
'budget',
'imdb_score']]
sns.set(style="ticks")
ax = sns.pairplot(temp)
plt.show();
ax = sns.heatmap(round(temp.corr(),2),
cmap="YlGnBu",
cbar_kws={'ticks': [-1.0, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0]},
vmin=-1,
vmax=1,
linewidths=.5,
annot=True)
ax.set_title('HEATMAP PLOT - PEARSON CORRELATION COEFFICIENTS BETWEEN VARIABLES')
ax.figure.set_size_inches(13,10)
plt.show();
temp['imdb_score'].describe()
sns.set_style("whitegrid")
sns.pairplot(temp,
x_vars=["duration", "director_facebook_likes", "num_user_for_reviews", "num_critic_for_reviews"],
y_vars=["imdb_score"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'navy'}},
height=4, aspect=1, kind="reg")
plt.suptitle('PAIRPLOTS OF IMDB SCORES WITH OTHER VARIABLES', size = 20)
sns.pairplot(temp,
x_vars=["num_voted_users", "cast_total_facebook_likes", "movie_facebook_likes", "facenumber_in_poster"],
y_vars=["imdb_score"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'navy'}},
height=4, aspect=1, kind="reg");
sns.pairplot(temp,
x_vars=["title_year", "aspect_ratio", "gross", "budget"],
y_vars=["imdb_score"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'navy'}},
height=4, aspect=1, kind="reg");
temp['gross'].describe()
sns.set_style("whitegrid")
sns.pairplot(temp,
x_vars=["duration", "director_facebook_likes", "num_user_for_reviews", "num_critic_for_reviews"],
y_vars=["gross"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'darkolivegreen'}},
height=4, aspect=1, kind="reg")
plt.suptitle('PAIRPLOTS OF GROSS EARNINGS WITH OTHER VARIABLES', size = 20)
sns.pairplot(temp,
x_vars=["num_voted_users", "cast_total_facebook_likes", "movie_facebook_likes", "facenumber_in_poster"],
y_vars=["gross"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'darkolivegreen'}},
height=4, aspect=1, kind="reg");
sns.pairplot(temp,
x_vars=["title_year", "aspect_ratio", "budget", "imdb_score"],
y_vars=["gross"],
plot_kws={'line_kws':{'color':'red'}, 'scatter_kws':{"s": 10, 'alpha':0.3, 'color': 'darkolivegreen'}},
height=4, aspect=1, kind="reg");
temp.head()
X = temp.iloc[:, :-1].values
y = temp.iloc[:, -1].values
X.shape
import statsmodels.api as sm
X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)
X_opt = X[:,[0,1,2,3,4,5,6,7,8,9,10,11,12]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:,[0,1,3,4,5,7,8,9,11]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()